introduction

This report explores a dataset containing the attributes of red wine for approximately 1599 red wines.

Univariate Plots Section

## [1] 1599   13

Our Red Wine dataset contains 1599 instances and 13 variables for each instance.

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

The 13 variables are “X”, “fixed.acidity”,“volatile.acidity”,“citric.acid”,“residual.sugar” “chlorides”,“free.sulfur.dioxide”,“total.sulfur.dioxide”, “density”, “pH”, “sulphates”, “alcohol” and “quality”. The values of all variables are either integers or numbers.

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

The first “variable” (“X”) is the same as the number of instance. So it is not really a variable. Some noticable things are: 1. the lowest value of citric.acid is 0. 2. the max value of total.sulfur.dioxide is over four times the value of 3rd quantile. Outliers may exist.

The distributions of quality, density and pH are more like normal distribution. The distribution of quality peaks at 5 and 6, which are within the middle 50% of the data. The distribution of density peaks between 0.996 and 0.998, which is also the center of the distribution range (0.989 ~1.004). The distribution of pH peaks between 3.25 and 3.4, with the lowest value lower than 2.5 and higher value higher than 4.0. The density, pH value, and quality of a wine may be correlated.We will calculate them in the next section.

The distribution of residual.sugar is highly right-skewed, with the majority of its value between 1 and 3. When we exclude the values above 4, we have a roughly normal distribution. Yet some wines may be very sweet or contain much more sugar. So I will keep all the values for now. The distributions of sulphates and alcohol are also right skewed. The majority of values of sulphates are between 0.25 and 1.0. The distribution of alcohol peaks at 9.5 but the all values exist between 9~13.

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

The quality of wine in the dataset spreads from 3 to 8. The median is 6 and the mean is 5.656. The density spreads from 0.9901 to 1.0037 with the median and mean of 0.9968.The pH of wine spreads from 2.740 to 4.010. The median and mean are both 3.31. Their median and mean are very close and there are no obvious outliers in terms of these three variables.

These two acidity may be coorrelated so I plotted them and do find they follow about the similar distirbution. Most wines have fixed.acidity between 4 and 14. Most red wines have volatile.acidity between 0.2 and 1.4.

However, citric.acid does not follow the same distribution as the two acidity above.There are many 0.00 values.

##     X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1   1           7.4            0.700           0            1.9     0.076
## 2   2           7.8            0.880           0            2.6     0.098
## 5   5           7.4            0.700           0            1.9     0.076
## 6   6           7.4            0.660           0            1.8     0.075
## 8   8           7.3            0.650           0            1.2     0.065
## 13 13           5.6            0.615           0            1.6     0.089
##    free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                   11                   34  0.9978 3.51      0.56     9.4
## 2                   25                   67  0.9968 3.20      0.68     9.8
## 5                   11                   34  0.9978 3.51      0.56     9.4
## 6                   13                   40  0.9978 3.51      0.56     9.4
## 8                   15                   21  0.9946 3.39      0.47    10.0
## 13                  16                   59  0.9943 3.58      0.52     9.9
##    quality
## 1        5
## 2        5
## 5        5
## 6        5
## 8        7
## 13       5

I displayed the first few rows of data and find it normal with all other attributes available.It is possible that some red wines do not contain any citric acid.

Similarly, free.sulfur.dioxide and total.sulfur.dioxide are both sulfur dioxide and the plots show their distributions are similarly right-skewed. There are a few extreme samples with total.sulfur.dioxide above 250 while most samples are below 175.

##         X fixed.acidity volatile.acidity citric.acid residual.sugar
## 1080 1080           7.9              0.3        0.68            8.3
## 1082 1082           7.9              0.3        0.68            8.3
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 1080      0.05                37.5                  278 0.99316 3.01
## 1082      0.05                37.5                  289 0.99316 3.01
##      sulphates alcohol quality
## 1080      0.51    12.3       7
## 1082      0.51    12.3       7

When displaying the detailed data, I find the two samples have all other features exactly the same except the total.sulfur.dioxide: one 278 and the other 289. They are very like mistakes so we will exclude them in our data analysis.

##       X fixed.acidity volatile.acidity citric.acid residual.sugar
## 152 152           9.2             0.52        1.00            3.4
## 259 259           7.7             0.41        0.76            1.8
##     chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 152     0.610                  32                   69  0.9996 2.74
## 259     0.611                   8                   45  0.9968 3.06
##     sulphates alcohol quality
## 152      2.00     9.4       4
## 259      1.26     9.4       5

Similarly we found two samples with extremely high chlorides (above 0.61 while the majority is below 0.2).Yet they are valid values for now. the chlorides below 0.15 almost follows normal distribution.

Univariate Analysis

What is the structure of your dataset? There are 1597 red wines in the dataset with 12 features ( “fixed.acidity”“volatile.acidity”“citric.acid” “residual.sugar”“chlorides”“free.sulfur.dioxide”“total.sulfur.dioxide”“density”“pH”“sulphates”“alcohol”“quality”) and a “X” variable. “X” stands for the number of the wine. All factors in this dataset are numerical factors.

Other observations: pH range: 2.740 ~ 4.010, median and mean are both 3.31. quality range: 3~8, median is 6 and mean is 5.64. density range: 0.9901 ~1.0037, the median and mean are 0.9968.

What is/are the main feature(s) of interest in your dataset? The main features in the data set are quality and alcohol. I’d like to determine which features are best for predicting the quality of a red wine. I suspect alcohol and some combination of the other variables can be used to build a predictive model to find the quality of a wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Density, acidity, residual.sugar and pH may also affect the quality of the red wine. In addition, the acidity, sulfur dioxide and sulphates may contribute to the pH level of red wine.

Did you create any new variables from existing variables in the dataset? I did not create any new variales because the existing variables are very self-explanatory and there is no obvious need of a new variable.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

In the total.sulfur.dioxide column, I find the two samples have extremely high value. Then I displayed them and found they have all other features exactly the same except the total.sulfur.dioxide: one 278 and the other 289. They are very like mistakes so we will exclude them in our data analysis. In the chlorides column, the plot is right skewed with many outliers beyond 0.20 but if we only look at the main part below 0.20, the distribution is more normal. I also exluded “X” because it is not an effective variable but a number.

Bivariate Plots Section

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000     -0.256785091  0.67422712
## volatile.acidity       -0.25678509      1.000000000 -0.55123061
## citric.acid             0.67422712     -0.551230612  1.00000000
## residual.sugar          0.11724817      0.008531062  0.13460975
## chlorides               0.09350529      0.060113411  0.20657072
## free.sulfur.dioxide    -0.15358722     -0.007234366 -0.06678113
## total.sulfur.dioxide   -0.11480905      0.091061805  0.01718835
## density                 0.66901320      0.019058755  0.37181600
## pH                     -0.68522731      0.232618233 -0.53954875
## sulphates               0.18283587     -0.262772500  0.31609500
## alcohol                -0.06125771     -0.200071683  0.10576656
## quality                 0.12478955     -0.388954729  0.22294288
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity           0.117248172  0.093505291        -0.153587216
## volatile.acidity        0.008531062  0.060113411        -0.007234366
## citric.acid             0.134609755  0.206570720        -0.066781129
## residual.sugar          1.000000000  0.060344077         0.178818077
## chlorides               0.060344077  1.000000000         0.007648019
## free.sulfur.dioxide     0.178818077  0.007648019         1.000000000
## total.sulfur.dioxide    0.173644280  0.056479559         0.673018994
## density                 0.369731554  0.199266945        -0.017107013
## pH                     -0.076652612 -0.267716681         0.075814084
## sulphates               0.010113585  0.370713239         0.054092715
## alcohol                 0.033472904 -0.219898510        -0.074314971
## quality                 0.005146394 -0.127500330        -0.055278579
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.11480905  0.66901320 -0.68522731
## volatile.acidity               0.09106181  0.01905875  0.23261823
## citric.acid                    0.01718835  0.37181600 -0.53954875
## residual.sugar                 0.17364428  0.36973155 -0.07665261
## chlorides                      0.05647956  0.19926694 -0.26771668
## free.sulfur.dioxide            0.67301899 -0.01710701  0.07581408
## total.sulfur.dioxide           1.00000000  0.09166396 -0.05067752
## density                        0.09166396  1.00000000 -0.34796081
## pH                            -0.05067752 -0.34796081  1.00000000
## sulphates                      0.05260413  0.14682779 -0.19935467
## alcohol                       -0.22958904 -0.49406360  0.21084985
## quality                       -0.20758070 -0.17159200 -0.05382786
##                        sulphates     alcohol      quality
## fixed.acidity         0.18283587 -0.06125771  0.124789555
## volatile.acidity     -0.26277250 -0.20007168 -0.388954729
## citric.acid           0.31609500  0.10576656  0.222942880
## residual.sugar        0.01011358  0.03347290  0.005146394
## chlorides             0.37071324 -0.21989851 -0.127500330
## free.sulfur.dioxide   0.05409271 -0.07431497 -0.055278579
## total.sulfur.dioxide  0.05260413 -0.22958904 -0.207580703
## density               0.14682779 -0.49406360 -0.171591997
## pH                   -0.19935467  0.21084985 -0.053827862
## sulphates             1.00000000  0.09575591  0.253822305
## alcohol               0.09575591  1.00000000  0.474207756
## quality               0.25382230  0.47420776  1.000000000

There is no strong correlation(above 0.90) between any two variables. Yet there are a few variables that are correlated closer than others: free.sulfur.dioxide and total.sulfur.dioxide (0.673018994), fixed.acidity and citric.acid(0.67422712), fixed.acidity and pH (-0.68522731).fixed.acidity and density (0.66901320).

From the data, fixed acidity, residual sugar, total sulfur dioxide, free sulfur dioxide, density, and pH do not seem to have strong correlations with quality, but alcohol, sulphates, and volatile acidity are moderately correlated with quality. I want to look closer at boxplots and scatter plots involving quality and some other variables like alcohol, sulphates, and volatile.acidity.

Interetingly, I find many factors moderately and negatively correlate with pH value. It makes sense because the more acidity or acid we have, the lower pH value we will get. But I will focus on quality and explore the relevant plots in detail.

From the boxplot, I can tell that wines of higher quality contain more alcohol in general except at quality 3 and 4. The average alcohol of wines of quaity 4 is higher than that of quality 3, which is against observations at other quality value.

Wines of higher quality contain more sulphates in general and the correlation is 0.3.

Interestingly, red wines of higher quality actually contain fewer volatile acidity and the correlation is -0.4.

As we expected, free.sulfur.dioxide and total.sulfur.dioxide are highly correlated(0.7). Wines with more free sulfur dioxide tend to have higher total sulfur dioxide values.

Contrary to our expectation, volatile acidity and fixed acidity are not linearly related. Instead, red wines with more fixed acidity tend to have more citric acid and is more dense.

As we expect, more fixed acidity will indicate lower pH. However, pH value is not a good sign for if a wine is of good quality.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I am interested in studying what factors would indicate the quality of a red wine. After basic exploration, I find alcohol, sulphates, and volatile acidity are moderately correlated with quality. Red wines of better quality also have higher value in alcohol and sulphates but lower value in volatile acidity.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There are other features that have interesting relationships. For example, free.sulfur.dioxide and total.sulfur.dioxide are highly correlated. Wines with more free sulfur dioxide tend to have higher total sulfur dioxide values. Contrarary to our expectation, volatile acidity and fixed acidity are not linearly related. Instead, red wines with more fixed acidity tend to have more citric acid and is more dense. As we expect, more fixed acidity will indicate lower pH. However, pH value is not a good sign for if a wine is of good quality.

What was the strongest relationship you found? Not surprisingly, pH and fixed acidity havve the strongest relationship and the correlation is 0.68522731. In terms of the feature of our interest, alcohol seema to have the strongest relationship with it (0.47420776).

Multivariate Plots Section

From the analysis above, we know volatile.acidity and alcohol are both moderately correlated with quality(-0.4 and 0.5 respectively). So I want to explore if this correlation is strengthed with put together. In the plot, we can see that at level 3 and level 8, alcohol and volatile.acidity are positively correlated but at other quality levels, this is not true. So alcohol and volatile.acidity could be independant factors.

For the same reason above, I am looking into the relations between sulphates, alcohol, and quality. It turns out that red wines of better quality contain more alcohol and more sulphates but alcohol and sulphates are negatively correlated at almost all quality levels. My guess is when a red wine has a balance but high value of both alcohol and sulphates, it is likely to be of good quality.

I am interested in finding if the sweetness or residual.sugar will affect the alcohol or quality of a wine. The result confirmed my previous analysis that residual.sugar has nothing to do with the quality of a wine. Yet at the same quality level, wines with more residual.sugar have more alcohol.

Similarly, I want to know if the sourness would affect the quality of a wine. The result again shows that pH is not correlated with the quality of wine but pH is positively correlated with alcohol at all quality levels.

Free.sulfur.dioxide and total.sulfur.dioxide, volatile.acidity and fixed.acidity sound like relevant pairs to me. Therefore, I am exploring their percentages at different quality levels to see if the percentages are correlated with quality. The first percentage does not matter to quality levels but the second shows a negative correlation. This makes sense because we know volatile.acidity itself is negatively correlated with quality (-0.4).

## 
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = redwine)
## m2: lm(formula = I(quality) ~ I(alcohol) + sulphates, data = redwine)
## m3: lm(formula = I(quality) ~ I(alcohol) + sulphates + volatile.acidity, 
##     data = redwine)
## 
## =====================================================
##                        m1         m2         m3      
## -----------------------------------------------------
##   (Intercept)       1.889***   1.389***   2.615***   
##                    (0.175)    (0.178)    (0.196)     
##   I(alcohol)        0.359***   0.344***   0.308***   
##                    (0.017)    (0.016)    (0.016)     
##   sulphates                    1.001***   0.685***   
##                               (0.102)    (0.101)     
##   volatile.acidity                       -1.216***   
##                                          (0.097)     
## -----------------------------------------------------
##   R-squared             0.2        0.3        0.3    
##   adj. R-squared        0.2        0.3        0.3    
##   sigma                 0.7        0.7        0.7    
##   F                   462.7      292.9      266.6    
##   p                     0.0        0.0        0.0    
##   Log-likelihood    -1719.0    -1672.5    -1597.5    
##   Deviance            804.9      759.4      691.4    
##   AIC                3443.9     3353.0     3205.1    
##   BIC                3460.1     3374.5     3231.9    
##   N                  1597       1597       1597      
## =====================================================

The variables in this linear model can only account for 30% of the variance in the quality of red wines. So there is no very good linear model here.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest? In “density by fixed acidity and quality”“, I find that higher quality red wines fall mostly in the lower density part and higher fixed acidity usually means higher density regardless of the quality. At the same time, I find that red wines of higher quality contain higher alcohol at all density levels and at all residual sugar levels.

Were there any interesting or surprising interactions between features? At all quality levels, more fixed acidity usually indicates lower pH value, which accords to the rule in chemistry. An interesting thing is that residual sugar is not related to the quality of red wine, which defies people’s rumors like “the wine is too sweet to be a good wine.”

Final Plots and Summary

Plot I:

I am interested in finding the elements correlated with the quality of a red wine. First, I find the distribution of red wine quality follows almost a normal distribution, with most red wines in quality 4 and 5. Lowest quality level is 3 and highest is 8 in a scale of 0 to 10.

Plot II:

We already know that red wines of higher quality tend to have high alcohol and the correlation is 0.5. Yet from this plot we know that the correlation is not consistent. From level 5 to level 8, the positive correlation is most obvious. From level 3 to level 4, the correlation also applies. Yet quality level 4 actually contain more alcohol than wine of quality at level 5.Even wines of level 3 contain slightly more alcohol than level 5. This is against my previous analysis. The abnormal behavior could be because there are few examples at level 3 and level 4.

Plot III:

Red wines with higher alcohol are more likely to be of higher quality. At the best quality level 8 and worst quality level 3, wines with more volatile acidity tend to contain more alcohol. At other quality levels, the correlatiom between alcohol and volatile.acidity is not as obvious.

Reflection

The red wine dataset contain 1597 valid instances across 12 variables. Two instances are removed because they repeat each other’s most information and have an abnormally high total sulfur dioxide value. One variable in the original dataset is the number for each red wine.

I started by looking at the types of variables. They are all numerical data but quality level may be also inteprated as factoral. Other variables are continuous. Then I looked at how many instances we have for each value of each variable. Most variables have a nearly normal distribution in the main part but many of them (residual.sugar,sulphates, alcoho, fixed.acidity, volatile.acidity, chlorides, free.sulfur.dioxide and total.sulfur.dioxide) have some outliers forming a long tail in the right.

I am interested in studying what factors would indicate the quality of a red wine. I find alcohol, sulphates, and volatile acidity are moderately correlated with quality with the correlation 0.5, 0.3, -0.4 respectively. It means that red wines of better quality have higher value in alcohol and sulphates but lower value in volatile acidity. Furthermore, at the best quality level 8 and worst quality level 3, wines with more volatile acidity tend to contain more alcohol. At other quality levels, the correlatiom between alcohol and volatile.acidity is not as obvious.

Some limitations of this model include the quality of the data. I do observed some red wines with extremely high value in some factors(residual.sugar for example) but I am not sure if they are mistakes or valid data. With some background, I should be able to deal with the data more professionally. For future work, it is worthy to ask whether we should remove the outliers or investigate into them specifically. Each direction may give us interesting results.